1
多模态大语言模型架构的演进:从视觉中心到多感官融合
AI012Lesson 7
00:00

多模态大语言模型架构的演进

多模态大语言模型(MLLM)的发展标志着从单一模态的封闭系统向统一表示空间的转变,其中非文本信号(图像、音频、3D)被转化为大语言模型能够理解的语言。

1. 从视觉到多感官

  • 早期的MLLM:主要专注于用于图文任务的视觉变换器(ViT)。
  • 现代架构:整合音频(如HuBERT、Whisper)以及3D点云(如Point-BERT),以实现真正的跨模态智能。

2. 投影桥接

为了将不同模态连接到大语言模型,需要一个数学桥梁:

  • 线性投影:一种在早期模型(如MiniGPT-4)中使用的简单映射。
    $$X_{llm} = W \cdot X_{modality} + b$$
  • 多层MLP:一种两层结构(如LLaVA-1.5),通过非线性变换实现复杂特征的更优对齐。
  • 重采样器/抽象器:如Perceiver Resampler(Flamingo)或Q-Former等高级工具,可将高维数据压缩为固定长度的标记。

3. 解码策略

  • 离散标记:将输出表示为特定词典条目(如VideoPoet)。
  • 连续嵌入:使用“软”信号来引导专用下游生成器(如NExT-GPT)。
投影规则
为了让大语言模型处理声音或3D物体,信号必须被投影到大语言模型现有的语义空间中,使其被解释为“模态信号”而非噪声。
alignment_bridge.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
Question 1
Which projection technique is generally considered superior to a simple Linear layer for complex modality alignment?
Token Dropping
Two-layer MLP or Resamplers (e.g., Q-Former)
Softmax Activation
Linear Projection
Question 2
What is the primary role of ImageBind or LanguageBind in this architecture?
To generate text from images
To compress video files
To create a Unified/Joint representation space for multiple modalities
To increase the LLM context window
Challenge: Designing an Any-to-Any System
Diagram the flow for an MLLM that takes an Audio input and generates a 3D model.
You are tasked with architecting a pipeline that allows an LLM to "listen" to an audio description and output a corresponding 3D object. Define the three critical steps in this pipeline.
Step 1
Select the correct encoder for the input signal.
Solution:
Use an Audio Encoder such as Whisper or HuBERT to transform the raw audio waves into feature vectors.
Step 2
Apply a Projection Layer.
Solution:
Pass the audio feature vectors through a Multi-layer MLP or a Resampler to align them with the LLM's internal semantic space (dimension matching).
Step 3
Generate and Decode the output.
Solution:
The LLM processes the aligned tokens and outputs "Modality Signals" (continuous embeddings or discrete tokens). These signals are then passed to a 3D-specific decoder (e.g., a 3D Diffusion model) to generate the final 3D object.